Working with Nextstrain GenBank data

Nicholas G Reich

2022-02-09

Background on genomic data

Downloading Nextstrain-curated GenBank data

The Nextstrain project makes daily snapshots of GenBank data available for the US and the world. Specifically, the flat tab-separated file available at https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz is updated daily, typically around 11am or noon US Pacific time. This file is large (>350MB as of 2022-02-09). For the below, we assume that you have downloaded this file and unzipped it so the .tsv file can be read in directly.

A codebook for the fields in the dataset are available here.

Initial GenBank data exploration

Let’s start by loading some tidyverse packages that will be useful for us and then by reading in the dataset.

library(tidyverse)
library(lubridate)
theme_set(theme_bw())
genbank_global <- read_tsv("data/metadata.tsv")

This is a large file, with 3707198 rows. For starters, we create a version of these data that contain only US data, and only retains columns we are interested in. Further, we will reformat certain columns.

us_dat <- genbank_global %>% 
  filter(country=="USA") %>%
  mutate(date = ymd(date), 
         date_submitted = ymd(date_submitted),
         reporting_lag = as.numeric(date_submitted - date)) %>%
  select(strain, virus, Nextstrain_clade,  ## info on the virus
         region, country, division, location, ## info on location
         host, sampling_strategy, ## info about the sample
         date, date_submitted, reporting_lag) ## dates
us_dat %>%
  group_by(Nextstrain_clade) %>%
  summarize(n_samples = n()) 
Nextstrain_clade n_samples
19A 4445
19B 3966
20A 50509
20B 23174
20C 57723
20D 613
20E (EU1) 134
20G 62703
20H (Beta, V2) 2188
20I (Alpha, V1) 184305
20J (Gamma, V3) 20702
21A (Delta) 40794
21B (Kappa) 246
21C (Epsilon) 39149
21D (Eta) 1025
21E (Theta) 20
21F (Iota) 32890
21G (Lambda) 963
21H (Mu) 4970
21I (Delta) 89213
21J (Delta) 994549
21K (Omicron) 223836
21L (Omicron) 268
21M (Omicron) 24
NA 194

Note that the sampling_strategy field is empty for all US data.

us_dat %>%
  group_by(sampling_strategy) %>%
  summarize(n_samples = n()) 
sampling_strategy n_samples
? 1838603

The following figure shows the reporting lags by state as computed in the data as the difference between the date the sample was taken and the date_submitted, which is the date the sample was submitted to the GenBank system. There appears to be substantial variation by location (note the shorter lags in MA, CA and VT), with 75% of samples typically reported by 1 month out. Note this is subsetting to look at all data starting in August of 2021. Some analyses, for example the Rt analysis that the Bedford Lab has done, remove all samples from the last 10 days, to try to remove small sample effects in the early reporting.

us_dat %>%
  filter(date > ymd("2021-08-01")) %>%
  ggplot(aes(x=division, y=reporting_lag)) + 
  geom_boxplot() +
  scale_y_log10(name= "lag (days, log scale)", breaks=c(7, 14, 21, 30, 90, 360, 720)) +
  xlab(NULL) +
  theme(axis.text.x = element_text(angle=60, vjust=1, hjust=1))

Plot of samples over time

us_dat %>%
  group_by(date) %>%
  summarize(n_samples = n()) %>% 
  ggplot(aes(x=date, y=n_samples)) +
  geom_bar(alpha=.3, stat="identity") +
  geom_smooth(span=.1, se=FALSE)

Here is a plot of the prevalence of each clade by week across 2021.

by_clade <- us_dat %>%
  filter(year(date) == 2021) %>%   ## focus only on 2021 
  mutate(epiweek = epiweek(date)) %>%
  filter(epiweek < 53) %>%        ## values of 53 are at the start of 2021 
  group_by(Nextstrain_clade, epiweek) %>%
  summarize(clade_total = n()) %>%
  group_by(epiweek) %>%
  mutate(total_samples = sum(clade_total),
         pct_clade_in_epiweek = clade_total/total_samples)  

p <- by_clade %>%
  ggplot(aes(x=epiweek, y=pct_clade_in_epiweek, color=Nextstrain_clade)) +
  geom_line()

plotly::ggplotly(p)
010203040500.000.250.500.751.00
Nextstrain_clade19A19B20A20B20C20D20E (EU1)20G20H (Beta, V2)20I (Alpha, V1)20J (Gamma, V3)21A (Delta)21B (Kappa)21C (Epsilon)21D (Eta)21E (Theta)21F (Iota)21G (Lambda)21H (Mu)21I (Delta)21J (Delta)21K (Omicron)21L (Omicron)21M (Omicron)NAepiweekpct_clade_in_epiweek